Final Homework

Daniel Kwik

Supervised Learning

Question: Build a decision tree classifier for the type of each storm in nasaweather::storms based on its wind speed and pressure. Report its accuracy.

We will create a decision tree classifier to fit the model. Let’s start by inspecting the dataframe and plot a scatterplot of each of the storm types.

head(nasaweather::storms)
## # A tibble: 6 × 11
##   name     year month   day  hour   lat  long pressure  wind type        seasday
##   <chr>   <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>         <int>
## 1 Allison  1995     6     3     0  17.4 -84.3     1005    30 Tropical D…       3
## 2 Allison  1995     6     3     6  18.3 -84.9     1004    30 Tropical D…       3
## 3 Allison  1995     6     3    12  19.3 -85.7     1003    35 Tropical S…       3
## 4 Allison  1995     6     3    18  20.6 -85.8     1001    40 Tropical S…       3
## 5 Allison  1995     6     4     0  22   -86        997    50 Tropical S…       4
## 6 Allison  1995     6     4     6  23.3 -86.3      995    60 Tropical S…       4
storms <- nasaweather::storms
ggplot(data = nasaweather::storms, aes(x = pressure, y = wind)) +
  geom_point(data = nasaweather::storms %>% select(-type), alpha = .25, color = "grey") +
  geom_point(alpha = .25, position = position_jitter(.1), color = "green") +
  facet_wrap(vars(type))

Since the data is relatively well clustered, we can fit it into a decision tree to help us classify them. We will use wind and pressure in our model.

#Create model based on wind and pressure
model1 <- fit(decision_tree(mode = "regression"),
     type ~ wind + pressure,
     data = storms)
#fit model into a decision tree
model1 %>% 
  extract_fit_engine() %>% 
  rpart.plot::rpart.plot(roundint = FALSE, digits = 2, type = 5)

Clustering

Question Consider the 4,000 biggest cities in the world, given by:

big_cities <- mdsr::world_cities %>% 
  arrange(desc(population)) %>% 
  slice_head(n = 4000)
head(big_cities)
## # A tibble: 6 × 9
##   geoname_id name  latitude longitude country country_region population timezone
##        <dbl> <chr>    <dbl>     <dbl> <chr>   <chr>               <dbl> <chr>   
## 1    1796236 Shan…     31.2     121.  CN      23               22315474 Asia/Sh…
## 2     745044 Ista…     41.0      28.9 TR      34               14804116 Europe/…
## 3    3435910 Buen…    -34.6     -58.4 AR      07               13076300 America…
## 4    1275339 Mumb…     19.1      72.9 IN      16               12691836 Asia/Ko…
## 5    3530597 Mexi…     19.4     -99.1 MX      09               12294193 America…
## 6    1816670 Beij…     39.9     116.  CN      22               11716620 Asia/Sh…
## # … with 1 more variable: modification_date <date>

Construct a k-means clustering of the latitude and longitude of these cities. Describe (qualitatively) the results of clustering with k=2.

Let’s start by splitting our data into our training and testing data with a 80-20 split.

set.seed(12345)
big_cities_split <- initial_split(big_cities, prop = 4/5)
big_cities_train <- training(big_cities_split)
big_cities_test <- testing(big_cities_split)

head(big_cities_train)
## # A tibble: 6 × 9
##   geoname_id name  latitude longitude country country_region population timezone
##        <dbl> <chr>    <dbl>     <dbl> <chr>   <chr>               <dbl> <chr>   
## 1    2650839 Dudl…     52.5     -2.08 GB      ENG                199059 Europe/…
## 2    1805753 Jinan     36.7    117.   CN      25                4335989 Asia/Sh…
## 3    2034714 Sipi…     43.2    124.   CN      05                 555609 Asia/Sh…
## 4     515003 Oren…     51.8     55.1  RU      55                 550204 Asia/Ye…
## 5     963241 Rand…    -26.1     28.0  ZA      06                 337053 Africa/…
## 6    1859675 Kawa…     34.8    135.   JP      13                 160520 Asia/To…
## # … with 1 more variable: modification_date <date>

Next, we need to isolate our latitude and longitude columns and fix their scale between 0 and 1. After that, we will run our kmeans model with 2 centers.

cluster <- big_cities_train %>% 
  select(latitude, longitude) %>% 
  mutate(latitude = rescale(latitude, to = c(0,1))) %>% 
  mutate(longitude = rescale(longitude, to = c(0,1)))
head(cluster)
## # A tibble: 6 × 2
##   latitude longitude
##      <dbl>     <dbl>
## 1    0.865     0.468
## 2    0.735     0.825
## 3    0.789     0.847
## 4    0.859     0.639
## 5    0.222     0.558
## 6    0.720     0.880
set.seed(12345)
clustering_results <- cluster %>% 
  kmeans(nstart = 10, centers = 2)

big_cities_clusters <- big_cities_train %>% 
  mutate(cluster = as.factor(clustering_results$cluster))

Looking at our kmeans clustering results:

glance(clustering_results)
## # A tibble: 1 × 4
##   totss tot.withinss betweenss  iter
##   <dbl>        <dbl>     <dbl> <int>
## 1  267.         142.      126.     1
tidy(clustering_results)
## # A tibble: 2 × 5
##   latitude longitude  size withinss cluster
##      <dbl>     <dbl> <int>    <dbl> <fct>  
## 1    0.505     0.308  1018     60.6 1      
## 2    0.699     0.686  2182     81.1 2
head(big_cities_clusters)
## # A tibble: 6 × 10
##   geoname_id name  latitude longitude country country_region population timezone
##        <dbl> <chr>    <dbl>     <dbl> <chr>   <chr>               <dbl> <chr>   
## 1    2650839 Dudl…     52.5     -2.08 GB      ENG                199059 Europe/…
## 2    1805753 Jinan     36.7    117.   CN      25                4335989 Asia/Sh…
## 3    2034714 Sipi…     43.2    124.   CN      05                 555609 Asia/Sh…
## 4     515003 Oren…     51.8     55.1  RU      55                 550204 Asia/Ye…
## 5     963241 Rand…    -26.1     28.0  ZA      06                 337053 Africa/…
## 6    1859675 Kawa…     34.8    135.   JP      13                 160520 Asia/To…
## # … with 2 more variables: modification_date <date>, cluster <fct>

Now that our data has been properly formatted to visualize, let’s start by plotting a scatterplot.

#plot scatterplot
latlong_plot <-
  ggplot(big_cities_clusters, aes(x = longitude, y=latitude, color = cluster))+
  geom_point(alpha = 0.5)+
  coord_fixed(ratio = 1)
latlong_plot

But how do we know where on the globe these cities are? Let’s overlay this scatterplot onto a world map using plotly.

#overlay scatterplots over a world map using plotly
map_plot <- plot_geo(big_cities_clusters, lat = ~latitude, lon = ~longitude, color = ~cluster) %>% 
  add_trace(marker = list(opacity = 0.7))
map_plot

As we can see from our map, using a k-means clustering with k=2, the plots are broken roughly into 2 clusters, one containing North America, South America, and Africa, and the other cluster containing Europe, Asia, and Australia. There is some overlap within Africa, divided into the North and South, with the North clustered with Europe, and the South clustered with the Americas.

Databases

Which baseball players have hit 500 home runs (HR) OR 3000 hits (H) but have not (yet?) been inducted into the Baseball Hall of Fame?

First we will load the Lahman package

library(Lahman)

Next, we will inspect the data. By looking at our dataset structures and conducting some anti-joins between tables.

head(HallOfFame)
##    playerID yearID votedBy ballots needed votes inducted category needed_note
## 1  cobbty01   1936   BBWAA     226    170   222        Y   Player        <NA>
## 2  ruthba01   1936   BBWAA     226    170   215        Y   Player        <NA>
## 3 wagneho01   1936   BBWAA     226    170   215        Y   Player        <NA>
## 4 mathech01   1936   BBWAA     226    170   205        Y   Player        <NA>
## 5 johnswa01   1936   BBWAA     226    170   189        Y   Player        <NA>
## 6 lajoina01   1936   BBWAA     226    170   146        N   Player        <NA>
head(Batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
##   IBB HBP SH SF GIDP
## 1  NA  NA NA NA    0
## 2  NA  NA NA NA    0
## 3  NA  NA NA NA    1
## 4  NA  NA NA NA    0
## 5  NA  NA NA NA    0
## 6  NA  NA NA NA    0
head(People)
##    playerID birthYear birthMonth birthDay birthCountry birthState  birthCity
## 1 aardsda01      1981         12       27          USA         CO     Denver
## 2 aaronha01      1934          2        5          USA         AL     Mobile
## 3 aaronto01      1939          8        5          USA         AL     Mobile
## 4  aasedo01      1954          9        8          USA         CA     Orange
## 5  abadan01      1972          8       25          USA         FL Palm Beach
## 6  abadfe01      1985         12       17         D.R.  La Romana  La Romana
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1        NA         NA       NA         <NA>       <NA>      <NA>     David
## 2      2021          1       22          USA         GA   Atlanta      Hank
## 3      1984          8       16          USA         GA   Atlanta    Tommie
## 4        NA         NA       NA         <NA>       <NA>      <NA>       Don
## 5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
## 6        NA         NA       NA         <NA>       <NA>      <NA>  Fernando
##   nameLast        nameGiven weight height bats throws      debut  finalGame
## 1  Aardsma      David Allan    215     75    R      R 2004-04-06 2015-08-23
## 2    Aaron      Henry Louis    180     72    R      R 1954-04-13 1976-10-03
## 3    Aaron       Tommie Lee    190     75    R      R 1962-04-10 1971-09-26
## 4     Aase   Donald William    190     75    R      R 1977-07-26 1990-10-03
## 5     Abad    Fausto Andres    184     73    L      L 2001-09-10 2006-04-13
## 6     Abad Fernando Antonio    235     74    L      L 2010-07-28 2019-09-28
##    retroID   bbrefID  deathDate  birthDate
## 1 aardd001 aardsda01       <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001  aasedo01       <NA> 1954-09-08
## 5 abada001  abadan01       <NA> 1972-08-25
## 6 abadf001  abadfe01       <NA> 1985-12-17
Batting %>% 
  anti_join(x=Batting, y=People, by = "playerID") %>% head()
##  [1] playerID yearID   stint    teamID   lgID     G        AB       R       
##  [9] H        X2B      X3B      HR       RBI      SB       CS       BB      
## [17] SO       IBB      HBP      SH       SF       GIDP    
## <0 rows> (or 0-length row.names)
Batting %>% 
  anti_join(HallOfFame, by = "playerID") %>% head()
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
## 4 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
## 5 barkeal01   1871     1    RC1   NA  1   4  0  1   0   0  0   2  0  0  1  0
## 6 barrebi01   1871     1    FW1   NA  1   5  1  1   1   0  0   1  0  0  0  0
##   IBB HBP SH SF GIDP
## 1  NA  NA NA NA    0
## 2  NA  NA NA NA    0
## 3  NA  NA NA NA    1
## 4  NA  NA NA NA    0
## 5  NA  NA NA NA    0
## 6  NA  NA NA NA    0
HallOfFame %>% 
  anti_join(Batting, by = "playerID") %>% head()
##    playerID yearID    votedBy ballots needed votes inducted          category
## 1 bulkemo99   1937 Centennial      NA     NA    NA        Y Pioneer/Executive
## 2 johnsba99   1937 Centennial      NA     NA    NA        Y Pioneer/Executive
## 3 cartwal99   1938 Centennial      NA     NA    NA        Y Pioneer/Executive
## 4 chadwhe99   1938 Centennial      NA     NA    NA        Y Pioneer/Executive
## 5 mccarjo99   1939      BBWAA     274    206     3        N           Manager
## 6 landike99   1944 Old Timers      NA     NA    NA        Y Pioneer/Executive
##   needed_note
## 1        <NA>
## 2        <NA>
## 3        <NA>
## 4        <NA>
## 5        <NA>
## 6        <NA>

These 3 tables are connected with the same playerID key. There are playerIDs in Batting & People that are not in HallofFame. We will assume that these individuals they have all not been inducted into the HallofFame. Thus, we will join the 3 tables together and make a list of players who have hit 500 home runs or 3000, but have not been inducted into the Hall of Fame.

Batting %>%
  full_join(People, by = "playerID") %>%
  full_join(HallOfFame, by = "playerID") %>%
  filter(category == "Player") %>%
  filter(inducted != "Y") %>%
  filter(HR > 500 | H > 3000) %>%
  select(nameFirst, nameLast, HR, H, inducted)
## [1] nameFirst nameLast  HR        H         inducted 
## <0 rows> (or 0-length row.names)

There are no players who hit > 500 home runs or >3000 hits that have not been inducted into the HallOfFame database.

Text Data

Question: How many speaking lines are there in Macbeth? Speaking lines are identified by a line that starts with two spaces, then a string of capital letters (possibly including spaces) indicating the character’s name, followed by a period.

I will start by loading in the Macbeth file and parse the lines into separate rows.

library(tidyverse)
macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
data(Macbeth_raw, package = "mdsr")

#Parse lines into separate rows
macbeth <- Macbeth_raw %>% 
  stringi::stri_split_lines() %>%
  pluck(1)
length(macbeth)
## [1] 3194
head(macbeth)
## [1] "This Etext file is presented by Project Gutenberg, in"           
## [2] "cooperation with World Library, Inc., from their Library of the" 
## [3] "Future and Shakespeare CDROMS.  Project Gutenberg often releases"
## [4] "Etexts that are NOT placed in the Public Domain!!"               
## [5] ""                                                                
## [6] "*This Etext has certain copyright implications you should read!*"

Next, I will create a regex pattern and detect the regex pattern using str_subset & str_detect. I will show both the patterns detected and the full lines that contain the detected patterns.

pattern <- "^  [A-Z ]+[.]{1} "
#show the regex pattern detected to check
macbeth %>% str_subset(pattern) %>% str_extract(pattern) %>% head(10)
##  [1] "  FIRST WITCH. "  "  SECOND WITCH. " "  THIRD WITCH. "  "  FIRST WITCH. " 
##  [5] "  SECOND WITCH. " "  THIRD WITCH. "  "  FIRST WITCH. "  "  ALL. "         
##  [9] "  DUNCAN. "       "  MALCOLM. "
#show the lines detected to check
macbeth %>% str_subset(pattern) %>% head(10)
##  [1] "  FIRST WITCH. When shall we three meet again?"   
##  [2] "  SECOND WITCH. When the hurlyburly's done,"      
##  [3] "  THIRD WITCH. That will be ere the set of sun."  
##  [4] "  FIRST WITCH. Where the place?"                  
##  [5] "  SECOND WITCH. Upon the heath."                  
##  [6] "  THIRD WITCH. There to meet with Macbeth."       
##  [7] "  FIRST WITCH. I come, Graymalkin."               
##  [8] "  ALL. Paddock calls. Anon!"                      
##  [9] "  DUNCAN. What bloody man is that? He can report,"
## [10] "  MALCOLM. This is the sergeant"

Now that I have verified that our regex pattern is working correctly, I will compute the total number of lines containing the pattern.

macbeth %>% str_detect(pattern) %>% sum()
## [1] 644

There are 644 speaking lines in Macbeth.

Note: I included a forced space at the end of the regex pattern because there are two lines that fit the regex pattern but were not speaking lines. In the following code chunk, I displayed the two faulty lines using the regex pattern without a forced space at the end.

pattern2 <- "^  [A-Z ]+[.]{1}"
macbeth %>% str_subset(pattern2) %>% head(2)
## [1] "  ASCII."                     "               P.O. Box 2782"

Question: Find the 10 most popular boys’ names in 2017 that end in a vowel. Use the babynames::babynames table. (Hint: str_detect.)

First I will load the babynames package.

library(babynames)
head(babynames::babynames)
## # A tibble: 6 × 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

We want only the names of boys in 2017, so we will filter for this.

vowelboys <- babynames::babynames %>% 
  filter(year == 2017 & sex == "M")
head(vowelboys)
## # A tibble: 6 × 5
##    year sex   name         n    prop
##   <dbl> <chr> <chr>    <int>   <dbl>
## 1  2017 M     Liam     18728 0.00954
## 2  2017 M     Noah     18326 0.00933
## 3  2017 M     William  14904 0.00759
## 4  2017 M     James    14232 0.00725
## 5  2017 M     Logan    13974 0.00712
## 6  2017 M     Benjamin 13733 0.00699

We will want a single vector for this.

vowelboys2 <- vowelboys$name

Now, we will use a regex pattern to detect all the names that end with a vowel. We will use str_subset to first check if it is working correctly.

regex <- "[a,e,i,o,u]$"
vowelboys2 %>% str_subset(regex) %>% head()
## [1] "Luke"   "Levi"   "Joshua" "Mateo"  "Eli"    "Leo"

Now that our regex pattern is working correctly, we will use str_detect to add a column to our dataset to indicate if the name ends with a vowel. Then, we will filter for only “TRUE” values, arrange the dataset by descending occurences (n column), and take the top 10 rows. This will give us the top 10 most popular boys names in 2017 that end in a vowel.

top10vowelboys <-vowelboys %>% 
  mutate(endVowel = name %>% str_detect(regex)) %>% 
  filter(endVowel == "TRUE") %>% 
  arrange(desc(n)) %>% 
  head(10)

The top 10 most popular boys names ending with a vowel in 2017 are Luke, Levi, Joshua, Mateo, Eli, Leo, Theodore, Ezra, Jose, Jace.